TextRank Algorithm for Key Phrase Extraction / Text Summarization.
Problem Statement¶
Suppose you are working for a Newspaper Publisher and you are assigned a task of tagging each of the articles using text mentioned in the articles. You donwload the corpus of data and realise that there are more than 1 Million articles to be tagged.
You have two options to choose from -
- Manually tagging articles
- Designing an Automated System for phrase extraction.
An attempt to manually accomplish this task has following disadvantages -
- Manual processes are subjective with hardly any quality control.
- The sunk cost (money plus time) associated with the manual excercise is huge.
- A manual extraction model is not scalable. With time constraints, the only option available will to add more resources which will diminish the returns on the project.
- The results will not be reproducible.
Its evident that manual process has a number of disadvantages.
This leaves us with option 2 of 'Designing an Automated System for phrase extraction'.
How can we automate the process?
To accomplish the task we need an Extractive NLP (Natural Language Processing) model which can do the task for us at scale. TextRank is inspired from Google's PageRank algorithm which leverages Markov chains for ranking web pages. The Pagerank algorithm is named after Google cofounder - Larry Page. The Pagerank Algorithm outputs the probability distribution used to represent the likelihood that a person clicking randomly on links will arrive on a particular page.
In this article, we shall try to extend pagerank algorithm to textual data and implement it from scratch using python.
Pre-requisites -¶
Before we jump into the task at hand, please refer to the following articles to brush up the theory on basic building blocks of a pagerank algorithm.
Data -¶
We shall be using "all-the-news" data published on kaggle
The Description of dataset is as follows -
- unnamed - index column
- id - Database ID
- title - Article title
- publication - Publication name
- author - Author name
- date - Date of publication
- year - Year of publication
- month - Month of publication
- url - URL for article (not available for all articles)
- content - Article content
Lets Start by importing python libraries that we will be needing throughout the process.¶
### Import python libraries
import warnings
warnings.filterwarnings('ignore')
import os
import string
from collections import defaultdict
import itertools
import operator
import networkx
import nltk
from nltk import sent_tokenize, word_tokenize
import numpy as np
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "none"
from IPython.display import display
from IPython.display import HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
display(HTML("<style>.container { width:100% !important; }</style>"))
raw_data = pd.read_csv('./data/news_articles/all-the-news/articles1.csv', encoding = 'utf-8',
dtype = {'content': unicode})
display(raw_data.head())
For the sake of convenience, we shall randomly sample 10 articles from the corpus.¶
Typically, using bigger data would help in estimating accurate rank scores. But this article is for demonstration purposes only. Since article categories are missing from the original data, one can sample articles and train a pagerank network in a better way using unsupervised clustering algorithm. A clustering would ensure that the articles in each cluster are coherent resulting in a more robust 'proof of concept'.
data = raw_data.loc[:10]
Before we dig into the implementation aspect, lets have a receipe / pseudo code documented which will help us track our progress -¶
Step 0 - Load data (Duh !!!)
Step 1 - Inspect and clean raw data. A free flowing textual data is always subject to noise which can affect the model training process. Hence it is imperative to perform pre-processing to ensure robust fetaure extraction. Please note that the cleaning techinques can be modified and made more elaborate depending on the quality of data.
Step 2 - Candidate Key-phrase Extraction. For the sake of convenience, we shall use a simple regex chunking technique to extract potential candidate phrases which will then be ranked using textrank algorithm. Please refer to this for an overview of phrase extraction. The article provides and overview of unsupervised as well as supervised techniques that can be used to extract and rank phrases.
Step 3 - Node Extraction. This is an important step in building a textrank model. Instead of including all the tokens of the cleaned data, we shall ensure that only valid and useful tokens are added as nodes.
Step 4 - Extra Bigrams and build a networkx graph. Using the bigrams from cleaned text and tokens from step 3 build a graph using networkx. One can explore the other types of graphs supported by networkx using the official documentation.
Step 5 - Train the textrank model. Training a textrank/pagerank model is an iterative process which estimates the steady state probabilities. These steady state probabilities for each node reflect the importance of that particular token in the context of the corpus.
Step 6 - Aggregate scores for potential candidates from step 2.. In this article, we have taken a simple sum of all the token on candidates phrases which are ranked by the textrank algorithm. This technique has an obvious disadvantage that longer phrases will come out are winners which may not be ideal for certain applications.
Step 7 - Scoop out the Creamy layer. This step is optional and subject to domain of application.
Step 8 - Reverse map the shortlisted keyphrase to original article
Step 9 - Inspect results.
Lets start with Step 1 of basic text pre-processing. We shall clean the contents of new articles by -¶
- Removing punctuations except for fullstop.
- Convert sentences to lower case.
punctuations = set(string.punctuation)
punctuations_map = dict((ord(char), None) for char in punctuations)
def clean_text(raw_text):
sentences = sent_tokenize(raw_text)
punctuations_free_sentences = [" ".join(raw_string.translate(
punctuations_map).split()) for raw_string in sentences]
concatenated_sentences = ". ".join(punctuations_free_sentences)
case_sensitive_text = concatenated_sentences.lower()
return case_sensitive_text
data.loc[:, 'cleaned_text'] = data.loc[:, 'content'].apply(clean_text)
Step 2 - Potential Candidate Extraction.¶
Next, we shall extract candidate phrases using regex chunker. The regex chunker is designed to extract noun phrases from a corpus of text.
For more details on how to design regex chunker, please refer following resources -
def get_chunks(text, chunker, lang='en'):
tagged_sents = nltk.pos_tag_sents(
word_tokenize(sent) for sent in sent_tokenize(text))
all_chunks = list(
itertools.chain.from_iterable(
nltk.chunk.tree2conlltags(
chunker.parse(tagged_sent)) for tagged_sent in tagged_sents))
return all_chunks
def mine_potential_phrases(
text,
grammar=r'KT: {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+}',
lang='en'):
stop_words = set(nltk.corpus.stopwords.words('english'))
chunker = nltk.chunk.regexp.RegexpParser(grammar)
all_chunks = get_chunks(text, chunker, lang)
# join constituent chunk words into a single chunked phrase
candidates = [
' '.join(
word for word,
pos,
chunk in group).lower() for key, group in itertools.groupby(
all_chunks,
lambda word_pos_chunk: word_pos_chunk[2] != 'O') if key]
return [cand for cand in candidates if cand not in stop_words and not all(
char in punctuations for char in cand)]
def build_candidate_phrases(df,col_name, lang='en'):
unique_candidate_chunks = defaultdict(lambda: 0)
potential_chunks = df[col_name].apply(mine_potential_phrases)
for candidate_list in potential_chunks:
for single_candidate in candidate_list:
unique_candidate_chunks[single_candidate] += 1
return unique_candidate_chunks.keys()
unique_candidate_phrases = build_candidate_phrases(data, 'cleaned_text')
Step 3 - Node Extraction -¶
In order to build a textrank graph, we need to mine potential nodes and train to estimate the steady state probabilities.
def mine_node_candidates(text, lang='en', valid_pos_tags=set(
['JJ', 'JJR', 'JJS', 'NN', 'NNP', 'NNS', 'NNPS'])):
puncts_trans = string.maketrans("", "")
stop_words = set(nltk.corpus.stopwords.words('english'))
tagged_words = itertools.chain.from_iterable(nltk.pos_tag_sents(
word_tokenize(sent) for sent in nltk.sent_tokenize(text)))
candidates = [
word.encode("utf-8").translate(
puncts_trans, string.punctuation).lower() for word, tag in tagged_words if (
tag in valid_pos_tags and word.lower() not in stop_words) and not all(
char in punctuations for char in word)]
return candidates
def build_candidate_nodes(df, col_name, lang='en'):
unique_candidate_nodes = defaultdict(lambda: 0)
potential_nodes = df[col_name].apply(mine_node_candidates)
for candidate_list in potential_nodes:
for single_candidate in candidate_list:
unique_candidate_nodes[single_candidate] += 1
return unique_candidate_nodes.keys()
node_candidates = build_candidate_nodes(data, 'cleaned_text')
Step 4 - Generate N-grams and build a networkx graph¶
def generate_n_grams(data, col_name, lang='en'):
unique_ngrams = defaultdict(lambda: 0)
for cand_text in data[col_name]:
potential_ngrams = nltk.bigrams(word_tokenize(cand_text))
for single_ngram in potential_ngrams:
unique_ngrams[single_ngram] += 1
return unique_ngrams.keys()
n_grams = generate_n_grams(data, 'cleaned_text')
def construct_textrank_graph(bigram_candidates, candidate_words):
textrank_graph = networkx.DiGraph()
textrank_graph.add_nodes_from(candidate_words)
# sequence graph
for ind_bigram in bigram_candidates:
w1, w2 = ind_bigram
if w1 in candidate_words and w2 in candidate_words:
textrank_graph.add_edges_from([(w1, w2)])
else:
continue
return textrank_graph
textrank_graph = construct_textrank_graph(n_grams, node_candidates)
Step 5 - Train networkX model.¶
text_rank = networkx.pagerank(textrank_graph, max_iter=5000)
Step 6 - Aggregate Scores for each potential candidate -¶
def get_smart_dict(text_rank_model):
word_ranks = {word_rank[0]: word_rank[1] for word_rank in sorted(
text_rank_model.iteritems(), key=lambda x: x[1], reverse=True)}
keywords = sorted(
word_ranks.items(),
key=operator.itemgetter(1),
reverse=True)
scored_nodes_df = pd.DataFrame(
keywords, columns=[
'smart_token', 'smart_token_score'])
filtered_nodes_df = scored_nodes_df.loc[scored_nodes_df['smart_token_score'] > np.percentile(
scored_nodes_df['smart_token_score'], 25)]
smart_dict = filtered_nodes_df.set_index(
'smart_token')['smart_token_score'].to_dict()
return smart_dict
smart_dict = get_smart_dict(text_rank)
def compute_aggregated_scores(scored_smart_phrases, node_scores):
scored_smart_tag_candidates = {}
unique_scored_smart_token_keys = set(node_scores.keys())
for smart_tag_candidate in scored_smart_phrases:
smart_tokens = word_tokenize(smart_tag_candidate)
intersected_tokens = set(smart_tokens).intersection(unique_scored_smart_token_keys)
if len(intersected_tokens) >=2:
selected_scores = operator.itemgetter(*intersected_tokens)(node_scores)
if len(intersected_tokens) == 1:
scores = selected_scores
else:
scores = np.sum(list(selected_scores))
else:
scores = 0
scored_smart_tag_candidates[smart_tag_candidate] = scores
return scored_smart_tag_candidates
scored_node_candidates = compute_aggregated_scores(unique_candidate_phrases, smart_dict)
sorted_keywords = pd.DataFrame(
sorted(
scored_node_candidates.items(),
key=operator.itemgetter(1),
reverse=True),
columns=[
'key_phrase',
'score'])
def get_phrase_length(keyphrase_string):
phrase_length = len(word_tokenize(keyphrase_string))
return phrase_length
sorted_keywords['phrase_length'] = sorted_keywords['key_phrase'].apply(get_phrase_length)
Step 7 - Scoop out the Creamy Layer.¶
Please note that this step is subject to the domain of application.
creamy_layer_phrases = sorted_keywords.loc[(sorted_keywords['phrase_length'] > 1)
& (sorted_keywords['phrase_length'] <= 4)
& (sorted_keywords['score'] > 0)]
Step 8 - Reverse map key phrase to articles -¶
data.loc[:, 'key_phrases'] = ''
base = r'{}'
expr = '(?=.*{})'
for individual_phrase, smart_score in creamy_layer_phrases[['key_phrase', 'score']].values:
smart_tokens = word_tokenize(individual_phrase)
match_pattern = base.format(''.join(expr.format(w.encode('utf-8')) for w in smart_tokens))
filtered_indices = data[data['cleaned_text'].str.contains(match_pattern)].index
if len(filtered_indices):
data.loc[filtered_indices, 'key_phrases'] = data.loc[filtered_indices, 'key_phrases'] + "||{0}".format(individual_phrase)
else:
continue
Step 9 - Inspect sample Results¶
Inspecting results is an important step while developing algorithms and often under explored by ML/DS developers. Every domain of application has its own set of nuances which makes it less probable for any off-the-shelf algorithm to fit all the use-cases.
It is very important to perform a critical analysis of results in order to make the algorithm more robust, accurate and scalable.
Due to the scope of this article, I have already performed the necessary analysis and adjust some of the thresholds (hyper parameters) in above implementation to extract acceptable results.
Lets check out the first article in our dataset.
WASHINGTON — Congressional Republicans have a new fear when it comes to their health care lawsuit against the Obama administration: They might win. The incoming Trump administration could choose to no longer defend the executive branch against the suit, which challenges the administration’s authority to spend billions of dollars on health insurance subsidies for and Americans, handing House Republicans a big victory on issues. But a sudden loss of the disputed subsidies could conceivably cause the health care program to implode, leaving millions of people without access to health insurance before Republicans have prepared a replacement. That could lead to chaos in the insurance market and spur a political backlash just as Republicans gain full control of the government. To stave off that outcome, Republicans could find themselves in the awkward position of appropriating huge sums to temporarily prop up the Obama health care law, angering conservative voters who have been demanding an end to the law for years. In another twist, Donald J. Trump’s administration, worried about preserving executive branch prerogatives, could choose to fight its Republican allies in the House on some central questions in the dispute. Eager to avoid an ugly political pileup, Republicans on Capitol Hill and the Trump transition team are gaming out how to handle the lawsuit, which, after the election, has been put in limbo until at least late February by the United States Court of Appeals for the District of Columbia Circuit. They are not yet ready to divulge their strategy. “Given that this pending litigation involves the Obama administration and Congress, it would be inappropriate to comment,” said Phillip J. Blando, a spokesman for the Trump transition effort. “Upon taking office, the Trump administration will evaluate this case and all related aspects of the Affordable Care Act. ” In a potentially decision in 2015, Judge Rosemary M. Collyer ruled that House Republicans had the standing to sue the executive branch over a spending dispute and that the Obama administration had been distributing the health insurance subsidies, in violation of the Constitution, without approval from Congress. The Justice Department, confident that Judge Collyer’s decision would be reversed, quickly appealed, and the subsidies have remained in place during the appeal. In successfully seeking a temporary halt in the proceedings after Mr. Trump won, House Republicans last month told the court that they “and the ’s transition team currently are discussing potential options for resolution of this matter, to take effect after the ’s inauguration on Jan. 20, 2017. ” The suspension of the case, House lawyers said, will “provide the and his future administration time to consider whether to continue prosecuting or to otherwise resolve this appeal. ” Republican leadership officials in the House acknowledge the possibility of “cascading effects” if the payments, which have totaled an estimated $13 billion, are suddenly stopped. Insurers that receive the subsidies in exchange for paying costs such as deductibles and for eligible consumers could race to drop coverage since they would be losing money. Over all, the loss of the subsidies could destabilize the entire program and cause a lack of confidence that leads other insurers to seek a quick exit as well. Anticipating that the Trump administration might not be inclined to mount a vigorous fight against the House Republicans given the ’s dim view of the health care law, a team of lawyers this month sought to intervene in the case on behalf of two participants in the health care program. In their request, the lawyers predicted that a deal between House Republicans and the new administration to dismiss or settle the case “will produce devastating consequences for the individuals who receive these reductions, as well as for the nation’s health insurance and health care systems generally. ” No matter what happens, House Republicans say, they want to prevail on two overarching concepts: the congressional power of the purse, and the right of Congress to sue the executive branch if it violates the Constitution regarding that spending power. House Republicans contend that Congress never appropriated the money for the subsidies, as required by the Constitution. In the suit, which was initially championed by John A. Boehner, the House speaker at the time, and later in House committee reports, Republicans asserted that the administration, desperate for the funding, had required the Treasury Department to provide it despite widespread internal skepticism that the spending was proper. The White House said that the spending was a permanent part of the law passed in 2010, and that no annual appropriation was required — even though the administration initially sought one. Just as important to House Republicans, Judge Collyer found that Congress had the standing to sue the White House on this issue — a ruling that many legal experts said was flawed — and they want that precedent to be set to restore congressional leverage over the executive branch. But on spending power and standing, the Trump administration may come under pressure from advocates of presidential authority to fight the House no matter their shared views on health care, since those precedents could have broad repercussions. It is a complicated set of dynamics illustrating how a quick legal victory for the House in the Trump era might come with costs that Republicans never anticipated when they took on the Obama White House.
Now lets inspect the key phrases extracted from the above article -¶
mr trump
proceedings after mr trump
election of mr trump
mr obama
house republicans last month
house committee reports republicans
deal between house republicans
new administration despite years
house republicans
obama white house
health insurance before republicans
administration desperate
american donald j trump
obama health care law
many people
case house lawyers
effects of people
health care program
future administration time
incoming trump administration
trump administration
last year
last half
trump era
trump transition effort
white house
trump transition team
obama administration
health care law
law for years
new administration
district of columbia circuit
health care lawsuit
health care systems
judge rosemary m collyer
house speaker
affordable care act
general public
american president
views on health care
full control
many legal experts
health insurance subsidies
donald j
united states laws
new congress
deal cut
american officials
team of lawyers
united states
executive branch prerogatives
new fear
widespread internal skepticism
quick legal victory
health insurance
annual appropriation
treasury department
insurance market
republican allies
east side
executive branch
Concluding Comments -¶
As it can be seen from results, the extracted key phrases are actually quite accurate and does capture of the article.
Sure, there a few misses with phrases like "last year", "last half". Again extraction of such phrases is subjective and can be supressed by tweaking thresholds and hyper-parameters. For eg - I have extracted all the phrases with score > 0. By going aggressive on this score, one can suppress false positive.
This is by no means the only solution out there for extractive text modelling. Infact, there is so much to play around with. Each of the step in pseudo code has been truncated / simplified to bring out the hero - Pagerank.
Finally, I hope this article helps in establishing the baseline and provides an inspiration for everyone who are trying solve the this problem in their respective domains.
Further Readings -